A Study of K-Nearest Neighbour as an Imputation Method
نویسندگان
چکیده
Data quality is a major concern in Machine Learning and other correlated areas such as Knowledge Discovery from Databases (KDD). As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. One relevant problem in data quality is the presence of missing data. Despite the frequent occurrence of missing data, many Machine Learning algorithms handle missing data in a rather naive way. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we analyse the use of the k-nearest neighbour as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data.
منابع مشابه
P. Jönsson and C. Wohlin, "benchmarking K-nearest Neighbour Imputation with Homogeneous Likert Data", Empirical Software Engineering: an Benchmarking K-nearest Neighbour Imputation with Homogeneous Likert Data
Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (kNN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by ...
متن کاملAn Analysis of Four Missing Data Treatment Methods for Supervised Learning
One relevant problem in data quality is the presence of missing data. Despite the frequent occurrence and the relevance of missing data problem, many Machine Learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work we analyse the use of the k-nearest nei...
متن کاملFrequency Ratio: a method for dealing with missing values within nearest neighbour search
In this paper we introduce the Frequency Ratio (FR) method for dealing with missing values within nearest neighbour search. We test the FR method on known medical datasets from the UCI machine learning repository. We compare the accuracy of the FR method with five commonly used methods (three “imputation” and two “bypassing” methods) for dealing with values that are “missing completely at rando...
متن کاملMissing Data Imputation Based on Grey System Theory
This paper proposed a new weighted KNN data filling algorithm based on grey correlation analysis (GBWKNN) by researching the nearest neighbor of missing data filling method. It is aimed at that missing data is not sensitive to noise data and combined with grey system theory and the advantage of the K nearest neighbor algorithm. The experimental results on six UCI data sets showed that its filli...
متن کاملA Short Note on Using Multiple Imputation Techniques for Very Small Data Sets
This short note describes a simple experiment to investigate the value of using multiple imputation (MI) methods [2, 3]. We are particularly interested in whether a simple bootstrap based on a k-nearest neighbour (kNN) method can help address the problem of missing values in two very small, but typical, software project data sets. This is an important question because, unfortunately, many real-...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002